Prosper Loan Data Exploration

by Johnson Yang

Preliminary Wrangling

This document explores a dataset containing about 110,000 loan listings and their related attributes for Prosper, the first peer-to-peer lending marketplace in United States.

It seems that both the official API and the Prosper Data Dictionary showed that a 'N/A' value would be used to represent the credit grade of the loan entries with no ProsperRating (Alpha) value assigned to them, which is essentially the same as the "NC" value used in the pre-2009 CreditGrade column.

We will fill the ProsperRating (Alpha) column of these loan entries with the value "NC" instead to provide a uniform representation of the credit ratings in our analysis. As a result, the column ProsperRating (numeric) should also be updated accordingly.

What is the structure of the dataset?

There are 113,937 loan listings in the dataset with 81 features without applying any data wrangling procedure. Among these features, 61 of them are numerical variables, 3 are boolean variables, and 17 are string variables. A closer look reveals that 6 of these string variables are actually datetime values, 4 of these string variables are key columns used in relational database tables, leaving only 7 categorical variables in the dataset for analysis.

What is/are the main feature(s) of interest in the dataset?

The main interest of this report is the observation of loan listings with their LoanStatus codes labeled as either "Chargedoff" or "Defaulted". We are interested in finding the characteristics of these "bad" loan listings because these loans will only bring negative effects to all stakeholders (the borrowers, the investors, and Prosper itself) in the loan system, and we want to spot these bad loans when we see them (if possible).

We will try to choose the features which can be used to describe the financial status of a borrower. Following is the list of the features that will be used in the analysis:

Categorical

Boolean

Numerical

Additional derived feature(s) may be engineered later when needed.

What features in the dataset do you think will help support your investigation into the feature(s) of interest?

In most cases, each loan usually links to a borrower who is short on funds (or to put it simply, the expenditure is greater than the income for some reasons). And therefore we expect features that can directly connect to the borrowers' situations may provide support to the investigation (eg. Occupation, IncomeRange). Other indirect features may or may not show supportive results, and these features will also be inspected in the upcoming paragraphs.

Univariate Exploration

The contents of this section will be focused on the distribution of features. We will start by looking at the distribution of LoanStatus, the main variable of interest.

There are 11,992 loans which were considered "Chargedoff", and 5,018 loans which were labeled as "Defaulted". If we convert these loan counts to percent proportions, this would mean the overall "Chargedoff" rate is 10.5251%, and the overall "Defaulted" rate is 4.4042%.

Next we will check the distribution of the Boolean variables in the dataset.

Looks like it's almost a 50-50 chance for whether a borrower in the dataset is a home owner or not.

There are more than 90% of borrowers whose income information is considered verifiable. Notice that it is not required for a borrower to have the income information to be verifiable in order to post a loan listing.

Next we will check the distribution of the categorical variables in the dataset.

The CreditGrade column represents the credit rating of a loan posting originated before July 2009. During this period, most of the loans were graded either C or D.

The ProsperRating (Alpha) column represents the credit rating of a loan posting originated from and after July 2009. In contrast to pre-2009 period, most of the loans were graded between A to D after Prosper reopened their website.

There are 58,308 loans which the borrowers selected 'Debt Consolidation' as their purposes, and these loans took 51.1756% of the total loans. Also, notice there are also sizeable numbers of loans using 'Not Available' and 'Other' as their purposes, which is similar to anonymizing the loan purposes.

The EmploymentStatus column contains some overlapping information about a borrower's employment status when the loan posting was created. The "Employed" status essentially covers all employment status types by its own meaning, and it is too vague to be used as a way to outline the status of a borrower. However, the date of the first loan posting using these different EmploymentStatus codes may give us some hints.

Unlike other employment status codes, both "Employed" and "Other" were being adopted by Prosper starting from August 2010 (after Prosper reopened their website). This means Prosper decided to let the borrowers to give less personal info when applying for a loan, and 67,322 borrowers in total decided to benefit from this freedom.

The IncomeRange column showed that most of the loan postings were coming from the borrowers with low to moderate incomes. There are 32,192 loans with the borrowers' income range between 25,000-49,999 USD, and 31,050 loans with the borrowers' income range between 50,000-74,999 USD.

The Occupation column contains detailed information of a borrower's actual job type. Notice there are some occupations such as "Other" and "Professional" that are inexact, and unlike the value "Employed" in the EmploymentStatus column, the number of the borrowers who chose to hide their occupations is not as significant.

Finally we will check the distribution of the numerical variables in the dataset.

There are three peaks in the histogram of the LoanOriginalAmount column, namely

, indicating these may be the typical origination amounts of the loan in Prosper marketplace.

The histogram of the DebtToIncomeRatio column showed a wide blank area in x-axis. We will try to divide the original plot to two separate subplots, one with DebtToIncomeRatio <= 1 and another with DebtToIncomeRatio > 1.

The first subplot showed most of the loans have a DebtToIncomeRatio less than or equal to 1, with the peak of the distribution situated between 0.1 to 0.3. The second plot showed there relatively few loans which have a DebtToIncomeRatio greater than 1, meaning that most of the borrowers are aware of controlling their debt-to-income ratio when applying for a loan.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

There are 272 loans with their DebtToIncomeRatio equal to 10.01, this is because Prosper will internally process any debt to income ratio larger than 1000% to make them returned as a capped value of 1001%.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

We would create a lite version of the current LoanStatus variable and use it as the main variable of interest in the rest of the report. The motivation of this adjustment is provided below.

After checking the online docs of the Prosper API, it seems that some of loan status codes could be further reduced to make the LoanStatus column simplified. But before we wrangle this variable, we will see the loan status definition of the current API first:

We can see there are only 6 loan status codes in use ("5 = FINALPAYMENTINPROGRESS" is officially declared as deprecated), instead of the 12 possible LoanStatus values in the dataset. All the "Past Due" loan status codes would be considered as "Defaulted" in the current API, with the default reason marked as "1 = Delinquency", making the representation of LoanStatus cleaner and easier to understand.

However, there is no clear explanation about why "FinalPaymentInProgress" is marked as deprecated, but we can still try to figure it out by doing some reasoning:

In this project, "FinalPaymentInProgress" can be counted toward either "Current" or "Completed", as these status codes would have no effect to our analysis.

And following are the same plots we created earlier for the LoanStatus variable, but this time we will use the lite version of the same variable instead.

Bivariate Exploration

In this section we will inspect the relationships between the overall chargedoff/defaulted rates and various features of the loans we introduced earlier.

We will start from the Boolean variables in the dataset, just like what we did in the previous section.

It is confirmed that the borrowers who are not a homeowner are more likely to fail to repay their loans, though the overall rate differences are not very significant.

The relationship between overall chargedoff rate and verifiable incomes is anticipated, but the overall defaulted rate shows a different story. Perhaps this is due to the difference between a home and an income source: the former is an asset, and the latter may or may not be an asset (income sources could be a job, an investment, a property transfer, etc.), making this feature less reliable than the ownership of a home.

Next we will inspect relations between the overall chargedoff/defaulted rates and the categorical variables in the dataset.

Loans originated before July 2009 suffered from high chargedoff/defaulted rates (even loans with "AA" credit grade have a 9.6609% overall chargedoff rate), and these rates skyrocketed as the credit ratings of loans moved downward. The poor loan performance forced Prosper to obtain SEC registration for its loans and relaunch their service in July 2009.

Starting from July 2009, the overall chargedoff/defaulted rates of originated loans were pacified - at least for loans with moderate to good credit ratings (AA to C). For loans with lower credit ratings (D and below), they are still prone to defaults.

For loans that seem to come from borrowers with stronger financial strength (eg. buying vehicles, wedding/engagement, vacation, etc.), it is expected to see these loans have a lower overall chargedoff/defaulted rate. However, also notice that loans with certain purposes may have an overall chargedoff rate higher than 20%, and these may be red flags for either Prosper or the investors.

Remember the "Employed" and "Other" types were introduced by Prosper after they reopened their website in July 2009, and these types would certainly have a lower overall chargedoff/defaulted rate. But also remember there are 67,322 borrowers who prefer not to specify their employment status by choosing "Employed" instead of other exact types. Due to these reasons we would consider this feature as a biased variable and would not investigate it further.

The IncomeRange variable, one of the features that closely describes the financial status of a borrower, displays a trend that is similar to the one we see in either the CreditGrade or the ProsperRating (Alpha) features. The overall chargedoff/defaulted rates arise as the income range of the borrowers move downward.

The long, detailed list of borrower occupation showed some interesting facts. On the top side of the spectrum, we can see Judges and Dentists have a brilliant overall chargedoff rate of 0%. And on the bottom side, we can see most of the occupations with an overall chargedoff rate higher than 20% are Students. Though this finding can serve as an instant warning signal for the investors, this may also imply some ongoing problem in the US society, as students in US are being plagued by the negative cycle of loan repayment.

Lastly we will inspect relations between the overall chargedoff/defaulted rates and the numerical variables in the dataset.

It's interesting to see that most of the loans which have been labeled as chargedoff/defaulted have a loan amount value which falls in the range of 1,000 ~ 6,000 USD. Looks like the magnitude of the debt burden does not always connect to the likelihood of a borrower's failure to make payments regularly.

Also, notice there is no chargedoff/defaulted loan occurred when LoanOriginalAmount is greater than 25,000 USD.

Most of the chargedoff/defaulted loans in the dataset have their DebtToIncomeRatio values lower than 0.5. It seems that this finding aligns with the finding of the LoanOriginalAmount variable we see earlier, indicating there is a possibility that most of the borrowers in Prosper marketplace are weak in financial strength - perhaps too weak to even successfully repay their debts.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

When analyzing relationships between chargedoff/defaulted rates and several categorical variables, we can confirm some of these variables return some results that could be predicted by the common sense. For example, both the occupations and the income ranges showed expected trend of chargedoff/defaulted rate variations, and it is notable to see features related to credit ratings support the results we observed. However, since these are overall rates, it would be prudent to observe how the chargedoff/defaulted rates change under various faceted conditions.

On the other hand, the analysis of numerical variables presented us with the implication that most of the borrowers in Prosper marketplace are likely to be financially weak, and this may not only be an overall outcome, but also be a recurring, persistent phenomenon. For now, the possible way to detect this situation still remains to be seen.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Although this has nothing to do with the chargedoff/defaulted rates and other features we selected, we still feel inclined to create a line chart to display the growth of loan listing counts of every year. Because unlike ordinary businesses, the growth of loan cases has meanings more than merely the possible success and revenue of Prosper - it can also be interpreted as the increase of people who are financially weak and cannot but borrow money to keep their lives going.

The line chart showed the loan counts increased steadily after year 2011. The sudden fallback of year 2014 is due to the lack of data (our dataset ends at 2014-03-12), and we will create a second line chart to display the growth of loan counts of every quarter to remedy this problem.

When we split the unit of time from years to quarters, it becomes apparent that the number of loans per quarter increased sharply after 2013 Q1, and keeps on breaking highest record until 2014 Q1 (again, the data of 2014 Q1 is not a complete quarter anyway). Now we know there is indeed an escalation of borrowers (and maybe, people who are financially weak), perhaps we can try to create some adapted line plots to help us inspect the phenomenon we mentioned earlier.

Multivariate Exploration

In this section we will concentrate on the following topics derived from the previous section:

During the period before July 2009 website reopened, both the chargedoff and defaulted loan rates were rampant. Only a few number of occupations with AA credit ratings could be considered as reliable investments.

After July 2009 Prosper relaunch, both the chargedoff and defaulted loan rates improved dramatically, making the Prosper marketplace to become a stable platform of peer-to-peer lending. However, we should still pay attention to some occupations that have a high chargedoff/defaulted rate even when they carry a good Prosper Rating such as AA or A.

Unlike the credit rating columns which separate the dataset to two mutual exclusive portions, the income ranges accept data generated before and after July 2009 as a whole. And from the heatmap we created above, we can see it is obvious that the computed chargedoff/defaulted rates are heavily influenced by data generated before July 2009 (similar to a mean value being affected by outliers). We will try to create the same heatmaps using only data generated after Prosper reopened their website and see the differences.

We can see that after July 2009, there is only a few occupations with their income ranges belong to either "\$0" or "Not Employed" appeared in the heatmap, which means Prosper tightened their measures of credit ratings and risk control after their relaunch.

Next we will examine the growth trends of loans by credit ratings and income ranges.

After July 2009 relaunch, the D rating loans were the most common loan listings in the Prosper marketplace ranging from 2010 Q3 to 2011 Q4. Starting from 2012 Q1, the C rating loans becomes the mainstream of the loan marketplace, followed by the B rating loans.

In constrast to the growth trends of loans by credit ratings, the growth trends of loans by income ranges are less fluctuating. We can see that even after July 2009 relaunch, the borrowers with their income ranges fall in either "\$25,000-49,999" or "\\$50,000,74,999" always remained as the primary groups of borrowers in the marketplace.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When we pair either credit ratings or income ranges together with the chargedoff/default rates, not only can we rule out many overlooked risky occupations (those with high chargedoff/defaulted rates even when they carry AA or A ratings or with high income ranges), but also we can confirm the same trend observed in previous sections. For example:

Were there any interesting or surprising interactions between features?

It is revealed that most of the borrowers in the marketplace are people with low to moderate incomes, and this is not only an overall result (the result we already saw in the section of univariate exploration), but also a persistent ongoing trend, indicating that the Prosper marketplace is always a potentially risky choice of investment, because most of the debtors in the market are financially weak.